fix(server): different graph name share the same backend#3027
Conversation
imbajin
left a comment
There was a problem hiding this comment.
修复方向正确,用图命名空间前缀隔离 server ID 能解决跨图 key 碰撞问题。但存在一个阻塞性逻辑缺陷 + 测试缺失,建议修复后再合并。
PR 勾选了"已有测试覆盖",但全量搜索 hugegraph-server/hugegraph-test/ 后,没有任何测试覆盖"两图共享同一 backend + 相同 nodeId"的场景。
建议补充回归测试,至少验证:
~sys_graph和hugegraph使用相同nodeId(如"server-1")初始化时,selfNodeId()返回值互不相同- 两个图能并发启动,不抛出 parse error 或
"already in cluster"异常
存量部署 backend 中的 ServerInfo vertex key 格式为 server-1;升级后代码改为以 DEFAULT-hugegraph/server-1 查询,旧记录查不到便重新写入新记录,旧的 server-1 记录以孤儿形式留在 backend 中不被清理。对功能影响有限,但建议在 PR 描述或 Release Notes 中注明,避免运维困惑。
🧹 小建议:spaceGraphName 字符集
spaceGraphName() 的值由用户配置拼接而来,没有对 / 字符的校验。若用户将 graphSpace 配置为含 / 的字符串,生成的 ID 会出现多个 /,可能引发后续解析歧义。建议加一行断言或在文档中说明 graphSpace 不允许包含 /。
| // Scope server id to graph to avoid cross-graph collision | ||
| return IdGenerator.of(this.graph.spaceGraphName() + "/" + | ||
| this.globalNodeInfo.nodeId().asString()); | ||
| } |
There was a problem hiding this comment.
initServerInfo() 中读写路径 ID 不一致(阻塞性问题)
此处修改后 selfNodeId() 返回命名空间化的 ID(如 DEFAULT-hugegraph/server-1),但调用方 initServerInfo() 的存在性检查仍使用原始裸 ID,两条路径的 key 不同:
// initServerInfo()(本 PR 未修改这部分)
Id serverId = nodeInfo.nodeId(); // ❌ 裸 ID:"server-1"
HugeServerInfo existed = this.serverInfo(serverId); // ❌ 用 "server-1" 查询
// ...
this.globalNodeInfo = nodeInfo; // line 140,赋值后 selfNodeId() 才变
this.saveServerInfo(this.selfNodeId(), ...); // ✅ 保存 "DEFAULT-hugegraph/server-1"后果:
- lines 109–122 的 alive 检查(防止同名节点重复启动的保护)永远查不到已命名空间化的旧记录,保护逻辑被静默跳过
- 重启时旧的命名空间记录不会被正确识别,会直接写入新记录,旧记录成为孤儿数据积累在 backend 中
建议修法: 在 initServerInfo() 入口先设置 globalNodeInfo,再统一用 selfNodeId() 做后续查询,保持读写 key 一致:
public synchronized void initServerInfo(GlobalMasterInfo nodeInfo) {
E.checkArgument(nodeInfo != null, "The global node info can't be null");
this.globalNodeInfo = nodeInfo; // 先赋值,selfNodeId() 立即返回命名空间 ID
Id serverId = this.selfNodeId(); // ✅ 与 save / remove 路径使用同一个 key
HugeServerInfo existed = this.serverInfo(serverId);
if (existed != null && existed.alive()) {
// ... 存活检查 & 等待逻辑不变 ...
}
E.checkArgument(existed == null || !existed.alive(),
"The server with name '%s' already in cluster", serverId);
// master 唯一性检查不变 ...
this.saveServerInfo(serverId, this.selfNodeRole());
}这样读取和写入始终使用同一个 ID,alive 检查和 master 唯一性保护才能真正生效。
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3027 +/- ##
=============================================
+ Coverage 35.85% 93.25% +57.40%
+ Complexity 338 65 -273
=============================================
Files 802 9 -793
Lines 67995 267 -67728
Branches 8902 22 -8880
=============================================
- Hits 24381 249 -24132
+ Misses 41008 8 -41000
+ Partials 2606 10 -2596 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
imbajin
left a comment
There was a problem hiding this comment.
+1, the key collision fix now scopes the server id by graph name and the regression test covers the core case. I added the upgrade-impact note for old local-scheduler tasks; the remaining compatibility edge is limited to unfinished non-PD/local tasks that existed before upgrade, so this looks acceptable to merge from my side.
Purpose of the PR
During backend initialization, the server creates two graphs — ~sys_graph and hugegraph (default) — that share the same store backend. This leads to a key collision:
Main Changes
Change
ServerInfoManager.selfNodeId()which returns "server-1" previously to "{graphname}/server-1"Upgrade impact
This change namespaces the server id by graph name, so old unfinished tasks in the non-PD local scheduler may still reference the previous bare server id, for example
server-1.Those historical tasks can remain visible, but they may not be restored or cancelled by the new namespaced server id after upgrade. The impact is limited to unfinished local-scheduler tasks that already existed before upgrading; newly scheduled tasks use the new namespaced id. To avoid this compatibility edge case, finish or cancel pending local tasks before upgrading.
Verifying these changes
Does this PR potentially affect the following parts?
Documentation Status
Doc - TODODoc - DoneDoc - No Need